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Abstract 

Let X be a d dimensional vector of covariates and Y be the response variable. Under 
the nonparametric model Y = m(X) + er(X)e we develop an ANOVA-type test for the null 
hypothesis that a particular coordinate of X has no influence on the regression function. 
The asymptotic distribution of the test statistic, using residuals based on Nadaraya- Watson 
type kernel estimator and d < 4, is established under the null hypothesis and local alterna- 
tives. Simulations suggest that under a sparse model, the applicability of the test extends to 
arbitrary d through sufficient dimension reduction. Using p- values from this test, a variable 
selection method based on multiple testing ideas is proposed. The proposed test outper- 
forms existing procedures, while additional simulations reveal that the proposed variable 
selection method performs competitively against well established procedures. A real data 
set is analyzed. 

Keywords: Nonparametric regression; kernel regression; Lack-of-fit tests; Dimension 
reduction; Backward elimination. 
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1 Introduction 



For a response variable Y and a d dimensional vector of the available covariates X set 
m(X) = E{Y\X). The dual problems of testing for the significance of a particular covariate, 
and identification of the set of relevant covariates are very common both in applied research 
and in methodological investigations. Due to readily available software, these tasks are 
often performed under the assumption of a linear model, m(X) = X/3. Model checking 
fits naturally in the methodological context of hypothesis testing, while variable selection is 
typically addressed through minimization of a constrained or penalized objective function, 
such as Tibshirani's (1996) LASSO, Fan and Li's (2001) SCAD, Efron, Hastie, Johnstone 
and Tibshirani's (2004) least angle regression, Zou's (2006) adaptive LASSO, and Candes 
and Tao's (2007) Dantzig selector. 

At a conceptual level, however, the two problems are intimately connected: dropping 
variable j from the model is equivalent to not rejecting the null hypothesis Hq : (3j = 0. 
Abramovich, Benjamini, Donoho and Johnstone (2006) bridged the methodological divide by 
showing that application of the false discovery rate (FDR) controlling procedure of Benjamini 
and Hochberg (1995) on p values resulting from testing each H 3 Q can be translated into 
minimizing a model selection criterion of the form 



where S is a subset of {1,2, ... ,d} specifying the model, /3f denotes the least squares esti- 
mator from fitting model S, \S\ is the cardinality of the subset S, and the penalty parameter 
A depends both on d and \S\. This is similar to penalty parameters used in Tibshirani and 
Knight (1999), Birge and Massart (2001) and Foster and Stine (2004), which also depend on 
both d and \S\, and more flexible than the proposal in Donoho and Johnstone (1994) which 
uses A depending only on d, as well as AIC and Mallow's C p which use constant A. 
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Working with orthogonal designs, Abramovich et al. (2006) showed that the global min- 
imum of the penalized least squares ([1]) with the FDR penalty parameter is asymptotically 
minimax for i r loss, < r < 2, simultaneously throughout a range of sparsity classes, 
provided the level q for the FDR is set to q < 0.5. Generalizations of this methodology to 
non-orthogonal designs differ mainly in the generation of the p values for testing Hq : j3j = 0, 
and the FDR method employed. Bunea, Wegkamp and Auguste (2006) use p values gener- 
ated from the standardized regression coefficients resulting from fitting the full model and 
employ Benjamini and Yekuteli's (2001) method for controlling FDR under dependency, 
while Benjamini and Gavrilov (2009) use p values from a forward selection procedure where 
the ith stage p-to-enter is the ith stage constant in the multiple-stage FDR procedure in 
Benjamini, Krieger and Yekutieli (2006). 

Model checking and variable selection procedures based on the assumption that the re- 
gression function is linear may fail to discern the relevance of covariates whose effect on 
m(x) is nonlinear. See Tables |2] and El Because of this, procedures for both model checking 
and variable selection have been developed under more general/flexible models. See, for 
example Li and Liang (2008), Wang and Xia (2008), Huang, Horowitz and Wei (2010), Stor- 
lie, Bondell, Reich and Zhang (2011), and references therein. However, the methodological 
approaches for variable selection under these more flexible models have been distinct from 
those of model checking. 

This paper aims at showing that a suitably flexible and powerful nonparametric model 
checking procedure can be used to construct a competitive nonparametric variable selection 
procedure by exploiting the aforementioned conceptual connection between model checking 
and variable selection. Thus, this paper has two objectives. The first is to develop a proce- 
dure for testing whether a particular covariate contributes to the regression function in the 
context of a heteroscedastic nonparametric regression model. The second objective is to pro- 
pose a variable selection procedure based on backward elimination using the Benjamini and 
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Yekuteli (2001) method applied on the d p-values resulting from testing for the significance 
of each covariate. 

In Section [2j we formally describe the model and introduce the hypothesis, the test 
statistic and its asymptotic distribution under the null hypothesis and local alternatives. 
Section [3] presents the results of a simulation study where the performance of the proposed 
test statistic is compared to those of existing tests. In Section @] the proposed variable 
selection procedure is described and compared, in simulation studies and a real data set, to 
well established variable selection methods. 

2 Nonparametric Model Checking 

2.1 The Hypothesis and the Test Statistic 

Let Y be the response variable and X = (X 1; . . . , X^) the vector of available covariates. 
Set m(X) = E(Y\X.) for the regression function and define 



From its definition if follows that E((\X) = E(() = E(C\Xj) = 0, for all j = 1, . . . , d. Setting 
cr 2 (X) = Var(£|X), we have the model 



where e is the standardized error £. Based on a sample (Yj, X*),i = 1, . . . ,n, of iid obser- 
vations from model (j3J), we will consider testing the hypothesis that the regression function 
does not depend on the jth covariate. For simplicity in notation we set X = (Xi, X2), where 
Xi is of dimension (d — 1) and X2 is univariate. Setting E(Y\X.{) = mi(Xi) the hypothesis 
we will consider can be written as 



C = y-m(X). 



(2) 



Y = m(X) + cr(X)e, 



(3) 



H Q : m(xi,x 2 ) = mi(xi). 



(4) 
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To fully appreciate the nature of this hypothesis, let Fx x , Fx 2 denote the marginal dis- 
tribution functions of X l5 X 2 , respectively, and consider the ANOVA-type decomposition 

m(Xi,X 2 ) = /U + mi(Xi) + m 2 (X 2 ) + m 12 (X 1 ,X 2 ), (5) 

where (J. = f f m(x x , x 2 )dF Xl (xi)cLFx 2 (x 2 ), mi(xi) = / m(xi, x 2 )dFx 2 (x 2 ) - m 2 (x 2 ) = 
/ m(xi, x 2 )c?F Xl (xi) - /i, mi 2 (xi,x 2 ) = m(x 1; x 2 ) — H — mi(xi) - m 2 (x 2 ). Note that 
their definition implies J m 1 (x 1 )iiFx 1 (xi) = J" m 2 (x 2 )dF X2 (x 2 ) = f m 12 (x!, x 2 )c?Fx 1 (x 1 ) = 
/ mi 2 (xi,x 2 )dFx 2 (x 2 ) = 0. 

Under the null hypothesis (j!J) it further follows that 

mi(xi) = /i + mi(xi), fn 2 (X 2 ) = mi 2 (X 1) X 2 ) = 0. 

In the case that Xi,X 2 are independent, we also have i?(y|X 2 ) = \x under the null. 
Let now mi(Xii) = £ , (y|Xn), as before, and define the null hypothesis residuals as 

ei = ii-mi(Xii). (6) 

Since under the null hypothesis (BJ mi(Xij) = m(Xj), it follows that the null hypothesis 
residuals in ([6]) equal the residuals defined in (j2J) and thus 

E^\X 2l )=0. (7) 

The idea for constructing the test statistic is to think of the as data from a high- dimensional 
one-way ANOVA design with levels x 2 i, % — 1, . . . , n. Because of (0), it follows that under the 
null hypothesis (j3J) there are no factor effects, and we can use the high-dimensional one-way 
ANOVA statistic of Akritas and Papadatos (2004) for testing (j4]), after dealing with two 
important details. First, mi is not known and needs to be estimated. Second, the statistic 
of Akritas and Papadatos (2004) requires two or more observations per cell, but in regression 
designs we typically have only one response per covariate value. 
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To deal with the unknow m\ will use the Nadaraya- Watson kernel estimator, 

*i(x u ) = ± ( v fV X rx" Xl x s ) Y - «' = x > ■ ■ ■ ' n ' ^ 

with K# n (x) = \H n \~ 1 K(H~ 1 x), where K(-) is a bounded (d — l)-variate kernel function 
of bounded variation and with bounded support, and H n is a symmetric positive definite 
(d — 1) x (d — 1) matrix called the bandwidth matrix. Set 

& = - mi(X H ) 

for the estimated null hypothesis residuals. 

To deal with the requirement of more than one observation per cell we make use of 
smoothness conditions and augment each cell by including additional p — 1 £/s which corre- 
spond to the (p — l)/2 X21 values that are nearest to X 2 \ on either side. To be specific, we 
consider the (£j,X 2 j), % = 1, . . . ,n, arranged so that X 2 i 1 < X 2i2 whenever i\ < i 2 , and for 
each X 2i , (p — l)/2 < i < n — (p — l)/2, define the nearest neighbor window as 

W = jj : |^x 2 (X 2 ,) - F X2 (X 2l )| < ^-1} , (9) 

where Fx 2 is the empirical distribution function of X 2 . Wi defines the augmented cell 
corresponding to X 2i . Note that the augmented cells are defined as sets of indices rather 
than as sets of values. The vector of (n — p + l)p constructed "observations" in the 
augmented one-way ANOVA design is 

kv = (ijJ e ^( P -i)/2+i, • • • , e W n - {p - 1)/2 )'. (10) 

Let MST = MST(£ y ), MSE = MSE(£ y ) denote the balanced one-way ANOVA mean 
squares due to treatment and error, respectively, computed on the data £ v . The proposed 
test statistic is based on 

MST -MSE. (11) 
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2.2 Asymptotic results 



2.2.1 Asymptotic null distribution 

Theorem 2.1. Assume that the marginal densities /x i; fx 2 °f Xi, X 2; respectively, are 
bounded away from zero, the second derivatives of fx t and mi(x) are uniformly continuous 
and bounded, that a 2 (.,X2) '■— -E(£ 2 |X 2 = x 2 ) is Lipschitz continuous, sup x cr 2 (x) < 00, and 
E(ef) < 00. Assume that the eigenvalues, Aj, i = 1, . . . ,d — 1, of the bandwidth matrix H n 
defined in (TJj], converge to zero at the same rate and satisfy 

nA 2(d_1) 

nA, 8 ->■ and — — — ->• 00, z = 1, . . . , d - 1. (12) 
(log n) 2 



Then, under H in O), £/ie asymptotic distribution of the test statistic in ( TO]] groen fry 

n 1/2 (MST - MSE) 4 7V(0, 9 



3(p-l) " 

w /i ere r = / [/ a 2 (cci, x 2 )/x 1 |x 2 =x 2 (*i)^i] 2 fx 2 {x 2 )dx 2 . 

An estimate of r 2 can be obtained by modifying Rice's (1984) estimator as follows 

1 n-2 

^ 2 = ^T^y Efe " 4--i) 2 (6 +2 - 6 + i) 2 - (is) 

The next subsection gives the asymptotic theory under local additive and under general 
local alternatives. As these limiting results show, the asymptotic mean of the test statistic 
MST — MSE is positive under alternatives. Thus, the test procedure rejects the null 
hypothesis for "large" values of the test statistic. 

2.2.2 Asymptotics under local alternatives 

The local additive alternatives and the general local alternatives are of the form 

: m(x.ux 2 ) = m^xx) + p n m 2 (x 2 ), (14) 
iff : m(xi,x 2 ) = rrai(xi) + p ln m 2 (x 2 ) + p 2 n^i2(xi,x 2 ), (15) 



where the functions m 2 , fhn satisfy £ , (m 2 (X 2 )) = = E (mi2(xi, X 2 )) and p n = p\ n = 
arT x l^, p 2n = bn" 1 ^, for constants a, b. 

Theorem 2.2. Consider the notation and assumptions of Theorem \2.1[ Moreover, assume 
that 777.2 (x) is Lipschitz continuous. 



2. (Local General Alternatives^) Assume further that rfii2(xi, x 2 ) i> s Lipschitz contin- 
uous on X2 uniformly on x x . Then, under H± in (T5\). as n — >■ 00, 



/i G = P a 2 Var(m2(X 2 )y t pb 2 Var(m 1 2(X lj X 2 )}}-2pabCov(m2(X2) 7 m 1 2(X 1 ,X2)). If a = b 
the formula simplifies to p G = pa 2 V A ar(m 2 (X 2 )+mi 2 (X 1 , X 2 )). 

2.3 Practical considerations 

2.3.1 Using other estimators of m(xi) 

We conjecture that the asymptotic theory of the test statistic remains the same for a 
wide class of other nonparametric estimators of mi, such as local polynomial estimators or, 
under an additive model, the backfitting estimator. Moreover, if one is willing to assume 
additional smoothness conditions then, since use local polynomial estimators yields faster 
rates of convergence (Stone, 1982), we conjecture that the asymptotic theory of the test 
statistic based on such estimators can include covariate dimensionality greater than the 
present 4. A similar comment applies if one is willing to assume an additive model. 

An alternative version of the present kernel estimator incorporated in our simulations is 
a version of the estimator proposed by Newey (1994) and Linton and Nielsen (1995), and 



1. (Xocal Additive Alternatives^) Then, under 



in (fl^P , as n — > 00 , 



n 
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further studied in Mammen, Linton and Nielsen (1999) and Horowitz and Mammen (2004), 
computed as 



1 n 

mi(xi) = - } m(xi,X 2i ), 

71 < ■* 



n . 
i=i 



with m(xi, X 2i ) a Nadaraya-Watson kernel estimator of m(xi, x 2 ). Under the null hypothesis 
this also estimates E(Y\x.i), but under the alternative it estimates (see decomposition (JSJ)) 

mi(xi) = /i + mi(xi). 



Table 1: Rejection rates with alternative fitting methods 



Method 




9 = 

7 




9 = 

7 


1 







1 


2 


3 


4 


ANOVA-type 


(p=ll) 


.053 


.119 


.196 


.292 


.390 


ANOVA-type 


(P=9) 


.052 


.121 


.191 


.296 


.388 


ANOVA-type 


2 (p=ll) 


.054 


.150 


.218 


.315 


.440 


ANOVA-type 


2 (p=9) 


.054 


.122 


.201 


.320 


.422 



In contrast, under the alternative, mi(xi) estimates 

mi(xi) = ^t + mi(xi) + E(fh 2 (X 2 ) + m 12 (xi, A A 2 )|X 1 = xi). 

Thus, forming the residuals by £ = Y — mi(xi) inadvertently removes some of the effect of 
X 2 . The simulations reported in Table [U suggest that the test statistic using the residuals 
£ = Y — m 1 (x 1 ) (ANOVA-type2 in the table) can have improved power against non-additive 
alternatives. In this table, the data are generated according to the model Y = X\ + 9X 2 + 
^X\X 2 + e, with Xi,X 2 independent U(0,1) random variables and e ~ A^O^ 2 ). The 
reported rejection rates are based on 2000 simulation runs with n = 100. We conjecture 
that a similar alternative to the local polynomial estimator of m 1 will have improved power 
against non-additive alternatives. 
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2.3.2 Using dimension reducing techniques 



The conditions of Theorem 12.11 restrict d to be less than or equal to 4. However, under 
the assumption of a sparse model, the effects of the curse of dimensionality can be moderated 
through the use of dimension reduction techniques. In all simulations reported in Section fl~2l 
as well as the data analysis results of Section H~3l we used the classical sliced inverse regression 
(SIR) dimension reduction method of Li (1991). Moreover, we employed a variable screening 
method prior to applying SIR. The variable screening consists of performing the marginal 
test of Wang, Akritas and Van Keilegom (2008) for the significance of each variable, and 
keeping those variables for which the p-value is less than 0.5. 

3 Simulations: Model Checking Procedures 
3.1 Brief literature review 

Let X = (Xi, X2) be the vector of d available predictors, with Xi being c?i-dimensional. 
The problem of assessing the usefulness of X 2 , i.e. testing H : m(xi,x 2 ) = rai(xi), has 
been approached from different angles by many authors. The literature is extensive, so only 
a brief summary of some of the proposed ideas and the resulting test procedures is given 
below. For additional references see Hart (1997) and Racine, Hart and Li (2006). 

One class of procedures is based on the idea that the null hypothesis residuals, £ = 
Y — 77ii(Xx), satisfy i?(£|X) — under Hq and i?(£|X) — m(X) — mi(Xi) under the al- 
ternative. Thus, £'(£.E'(£|X)|X) = (m(X) — mi(Xi)) 2 under the alternative and zero under 
the null. Using this idea, Fan and Li (1996) propose a test statistic based on estimating 
E[^f 1 (X 1 )E^f 1 (X 1 )\X)f(X)} which equals E[(m(X) - mi(X0) 2 /i(X) 2 /(X)] under the al- 



io 



ternative and zero under the null. Their test statistic is 



-££/i(x u )] 

n z — ' 



5}&/i(Xy]jr 



3^1 



X-X 



where /i is the estimated density of X l5 is the estimated residuals under the null hy- 
pothesis, and K is a kernel function. Fan and Li (1996) show that their test statistic is 
asymptotically normal under H . Lavergne and Voung (2000) propose a test statistic based 
on different estimator of the same quantity as Fan and Li (1996), which is 
(n-4)! 



Ill 



Ew _ Yl)(Yl _ Yl)L „ (^) Ln (^) ,„ 



where ^ a is the sum over all permutations of 4 distinct elements chosen from n, L n = g~ dl L 
for a kernel L on R dl and K n = h~ d K for a kernel K on R d . Lavergne and Voung (2000) 
show that their test statistic is also asymptotically normal under H . 

A related class of procedures is based on direct estimation of E[(m(X.) — mi(Xi)) 2 W / (X)], 
for some weight function W. See, for example, Ait-Sahalia, Bickel, and Stoker (2001). The 
use of such test statistics is complicated by the need to correct for their bias. See also 
the bootstrap-based procedure of Delgado and Manteiga (2001). Because of the computer 
intensive nature of bootstrap-based procedures, these are not included in our comparisons. 

An additional class of test procedures uses alternatives based on Stone's (1985) addi- 
tive model. We will consider the procedure proposed by Fan and Jiang (2005). This is 
based on Fan, Zhang, and Zhang's (2001) Generalized Likelihood Ratio Test (GLR), using 
a local polynomial approximation and the backfitting algorithm for estimating the additive 
components. 

3.2 Numerical comparison 



In this section we compare the proposed ANOVA-type and ANOVA-type2 statistics de- 
scribed in Section [2.3. II to the statistics proposed by Lavergne and Vuong (2000) (LV in the 
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tables), Fan and Li (1996) (FL in the tables), and Fan and Jiang (2005) (GLR in the tables). 
The data is generated according to the models (also used in Lavergne and Vuong, 2000) 

Y=-X 1 + Xl + f 5 {X 2 ) + e, j = 0, 1, 2, 3, , 4, 5, 6, (16) 

where X l5 X 2 are iid N(0, 1) and e ~ iV(0, 4). Here, fo(x) = 0, which corresponds to the 
null hypothesis H : m(x 1 ,x 2 ) = m(x 1 ); fi(X 2 ) = .5X 2 , f 2 (X 2 ) = X 2 and h{X 2 ) = 2X 2 
give three linear alternatives, and fi{X 2 ) = sin{2nX 2 ), f^{X 2 ) = sin(nX 2 ), and fQ(X 2 ) = 
sin(2/37rX 2 ) give three non-linear alternatives. The kernel for the Nadaraya- Watson esti- 
mation of m(Xi) is the uniform on (—0.5, 0.5) density, and the bandwidth is selected through 
leave-one-out cross validation. The rejection rates shown in Table |2] for LV, FL, and F tests 
are taken from the simulation results reported in the LV paper (based on 2000 runs). It is 
important to note that, in each simulation setting, the LV paper reports several rejection 
rates for the LV and FL tests, each corresponding to different values of smoothing parame- 
ters. Since the best performing constants are different for different simulation settings, the 
rejection rates reported in Table [2] represent a) the most accurate alpha level achieved over 
all constants, and b) the best power achieved overall constants for each alternative. For 
comparison purposes, the rejection rates for the ANOVA-type tests and the GLR test are 
also based on 2000 simulation runs. 

As expected, the F test achieved the best results for the three linear alternatives and 
the worse results for the three non-linear alternatives. The GLR test has higher power than 
the ANOVA-type tests against linear alternatives (which is partly explained by the fact it 
is based on normal likelihood), but is much less powerful against the first of the non-linear 
alternatives. As the non-linearity decreases (f$ and Jq) the power of the GLR test improves. 

The GLR test is designed for additive models, which is exactly the simulation setting of 
Table [2j Under non-additive alternatives, however, it can perform poorly as indicated by the 
simulations reported in the first part of Table |3j These simulations use sample size n = 200 
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Table 2: Rejection rates under Ho, linear and non-linear alternatives 



linear sine 



11 


test 




fo 


fi 


h 


h 


h 


h 


h 


100 


LV 




.041 


.098 


.482 


.991 


.182 


.266 


.319 




FL 




.021 


.051 


.271 


.970 


.126 


.168 


.187 




ANOVA-type (p = 


9) 


.052 


.218 


.79 


.999 


.423 


.523 


.535 




ANOVA-type (p = 


7) 


.056 


.244 


.780 


.999 


.432 


.527 


.551 




ANOVA-type2 (p = 


= 9) 


.065 


.275 


.831 


1 


.453 


.598 


.600 




GLRT 




.044 


.365 


.951 


1 


.123 


.497 


.645 




F-test 




.051 


.695 


.997 


1 


.046 


.055 


.222 


200 


LV 




.054 


.208 


.875 


1 


.386 


.540 


.678 




FL 




.025 


.083 


.695 


1 


.289 


.395 


.471 




ANOVA-type (p = 


9) 


.055 


.374 


.95 


.999 


.73 


.778 


.788 




ANOVA-type (p = 


7) 


.051 


.376 


.979 


1 


.746 


.820 


.820 




ANOVA-type2 (p = 


:9) 


.069 


.487 


.999 


1 


.821 


.882 


.884 




GLRT 




.036 


.656 


1 


1 


.188 


.877 


.936 




F-test 




.052 


.931 


1 


1 


.051 


.053 


.340 



with data generated from the model Y = X 1 2 (l+9Xs)-\ — a -^- he, where e ~ N(0, 0.1), and 

Xi, X2, X3 are i.i.d. [7(0.5,2.5). The hypothesis tested is that m(Xi, X 2 , X3) = mi(Xi, X2). 
The residuals for the ANOVA-type test in the first part of Table [3] are based on a Nadaraya- 
Watson fit with kernel the uniform on (—0.5, 0.5) x (—0.5, 0.5) density and the common 
bandwidth selected through leave-one-out cross validation. 



Table 3: Rejection rates for non-additive and heteroscedastic models 





non 


-addit 


ive alternatives 

e 


heterocedast 


lie alternatives 
9 


test 





0.02 


.04 


.06 .08 


0.025 


.05 


.1 .2 


ANOVA-type (p = 9) 
GLR 


.052 
.048 


.176 
.082 


.609 
.110 


.940 .994 
.189 .304 


.053 .067 . 
.465 .511 . 


124 
624 


.485 .998 
.908 1 



Finally, it should be mentioned that the GLR test does not maintain its level under het- 
eroscedasticity. In simulations, reported in the second part of Table El under the additive but 
heteroscedastic model Y = X\ + 9 cos(ttX 2 ) + X 2 e, X x , X 2 i.i.d. N(0, 1), e ~ N(0, 0.5), using 
sample size n = 200, the GLR test is very liberal while the ANOVA-type test maintains an 



accurate level. 



4 From Model Checking to Variable Selection 

4.1 The proposed procedure 

In this section we will assume a sparse regression model in the sense that there exits 
a subset of indices Iq = {ji, ■ ■ ■ , jd } c {1, . . . , d} such that only the covariates Xj with 
j G Iq influence the regression function. Moreover, we will assume the dimension reduction 
model of Li (1991), i.e. m(x) = g(Bx), where B is a K x d matrix. In this context we will 
describe the following variable selection procedure using backward elimination based on the 
Benjamini and Yekuteli (2001) method for controlling the false discovery rate (FDR): 

1. Apply the variable screening procedure described in Section [2.3. 21 With a slight abuse 
of notation, the vector of the remaining covariates and its dimension will be denoted 
by x and d. 

2. Use SIR to obtain the estimator B. 

3. Obtain p- values from testing each of the hypotheses: 



H 3 : m(x) = mi(x ( _j)), j = l,...,d, 



(17) 



where x ( _ i} = (x x , x S - X , x j+1 , x d ): 



(a) Compute the test statistic (see Theorem 12. ip 




using residuals formed by a kernel estimator on the variables B(_j)X(_j), where 



B(_j) is the K x (d — 1) matrix obtained by omitting the jth column of B. 



(b) Compute the p-value for H 3 Q 



as TTj = 1 - $(Zj). 
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4. Compute 

I i a . . 

fe = max < 7 : 7rr,-i < — — 3 > (18) 

for a choice of level a, where 7T(i), . . . , 7T(d) are the ordered p- values. If fc = d stop and 
retain all variables. If k < d 

(a) update x by eliminating the covariate corresponding to 7T(<z), 

(b) update B by eliminating the column corresponding to the deleted variable, and 

(c) proceed to the next step. 

5. Repeat steps 131 and I3b| with the updated vx and B. 

Remarks. 1) Another approach for constructing a variable selection procedure is to use 
a single application of the Benjamini and Yekuteli (2001) method for controlling the false 
discovery rate (FDR). This is similar to one of the two procedures proposed in Bunea et 
al. (2006). However, this did not perform well in simulations and is not recommended. A 
backward elimination approach was Li, Cook and Nachtsheim (2005), but they did not use 
multiple testing ideas. 

2) Based on our simulation results, the variable screening part (Step 1) of the variable 
selection procedure does not improve the performance. However, it was included in the 
simulations as it reduces the computational time. 

4.2 Simulations: Variable selection procedures 

Because the ANOVA-type2 method (see Section [2. 3. ip is computationally more intensive, 
we used only the proposed variable selection method using the ANOVA-type test described 
in Section I2TTI with AN OVA cell sizes of 5 (when n = 40), 7, and 9 (when n = 110). The 
parameter a was set to 0.07 in Table H] (so the FDR is controlled at level (25 — 5)0.07/25 = 
0.056), and a = 0.06 in Table (so the FDR is controlled at levels 0.052 and 0.045). These 
procedures are compared with LASSO, SCAD, adaptive LASSO, the FDR-based variable 
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selection method proposed by Bunea, Wegkamp and Auguste (2005) (BWA in the tables), 
and a version of the BWA procedure which uses backward elimination (BWA+BE in the 
tables). The comparison criterion is the mean number of correctly and incorrectly excluded 
variables. All comparisons are based on 2000 simulated data sets. 



For LASSO we found that the R code in in http://cran.r-project.org/web/packages/ 
glmnet /index. html, with the lambda.lse option for selecting lambda, gave the best results; 
for adaptive LASSO we used the R code from |http:/ /wwwLstat .ncsu.edu/~boos/var.select 



/lasso. adaptive.html; for SCAD we used the function scadglm of the package SIS in R. 

In Table |4j data sets of size n = 110 were generated from the linear model Y = /3 T X + e, 
where e ~ N(0, 3 2 ), the dimension of X is d = 25, and 

(3 T = (3, 1.5, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0). 

The covariates are generated from a multivariate normal distribution with marginal means 
zero and covariances as shown in the table. It is seen that the proposed nonparametric 
variable selection procedures correctly exclude, on average, about 19.5 out of the 20 non- 
significant predictors. This is about as good as the procedures designed for linear models. 
The proposed procedures incorrectly exclude, on average, about 0.5 of the 5 significant pre- 
dictors, which is more than the other procedures (with the exception of BWA). 

Table 4: Comparisons using a linear model: d = 25, n = 110 



test 


E = 


= I 


S = 


(0.5l*- J 'l) 


correct 


incorrect 


correct 


incorrect 


SCAD 


19.48 


.026 


19.37 


.023 


LASSO 


18.29 


.005 


18.28 


.004 


Adaptive LASSO 


19.28 


.005 


19.26 


.025 


BWA 


19.99 


1.02 


19.97 


1.41 


BWA+BE 


19.55 


.001 


19.49 


.041 


ANOVA-type(p=7) 


19.46 


.63 


19.30 


.44 


ANOVA-type(p=9) 


19.52 


.65 


19.40 


.36 


data sets of size n = 


= 40 were 


generated from the models Y = 
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= 1, 2, where e ~ N(0, 0.3 2 ), the dimension of X is d = 8, and 

gi(x) = sin(7rxi), g 2 (x) = sin(3/47rxi) - 3$(-|x 5 | 3 ). 

The covariates are generated as normal with marginal means zero and covariance matrix 

Table 5: Comparisons using nonlinear models: d = 8, n = 40 

9i 92 



test correct incorrect correct incorrect 

SCAD 6\74 M 5TL L79 

LASSO 6.59 .92 5.72 1.80 

Adaptive LASSO 6.65 .95 5.62 1.73 

BWA 6.99 1 5.99 1.99 

BWA+BE 6.65 .94 5.70 1.75 

ANOVA-type(p=7) 6.21 0.001 5.75 .11 

ANOVA-type(p=5) 6.39 0.001 5.71 .08 



E = (0.5l i_J 'l). It is seen that the linear model based procedures fail to select the significant 
predictor(s) almost always. On the other hand, the proposed procedures always select the 
one relevant predictor under model g±, and exclude incorrectly, on average, about 0.08 out 
of the two important predictors under model g 2 . 



4.3 Real Data Example: Body Fat Dataset 

The Body Fat data is supplied by Dr. A. Garth Fisher for non- commercial purposes, 
and it can be found at "http://lib.stat.cmu.edu/datasets/bodyfat". The data set contains 
measurements of percent body fat (using Siri's (1956) method), Age (years), Weight (lbs), 
Height (inches), circunferences of Neck (cm), Chest (cm), Abdomen (cm), Hip (cm), Thigh 
(cm), Knee (cm), Ankle (cm), Biceps (cm), Forearm (cm) and Wrist (cm), from 252 men. 
The response variable is the percentage of body fat. 

We compare the results of SCAD, LASSO, Adaptive LASSO and BWA with backward 
elimination to the ANOVA-type procedure with variable screening and SIR, as described in 
Section O Table M\ shows the results for LASSO, SCAD, Adaptive LASSO and BWA. It 
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Table 6: Results for LASSO, Adaptive LASSO, SCAD, BWA 



Predictor 


t Ann / \ 

LASSO 


Adpt. LASSO 


SCAD 


BWA 


Age 


.06499 





aa -i r\r* -i 

.001061 





Weight 





-.09511 


-.11688 


-.1356 


Height 


-.1591 





-.05818 





Neck 


-.2579 











Chest 














Abdomen 


.7079 


.9113 


.9052 


.9958 


Hip 














Thigh 














Knee 














Ankle 














Biceps 














Forearm 


.21756 








.4729 


Wrist 


-1.5353 


-.9871 





-1.5056 



is seen that Weight and Abdomen are selected by all except LASSO. This can be explained 
by the fact that LASSO does not perform well in the presence of highly correlated variables, 
which is the case with this data set. The Adaptive LASSO and BWA give almost the same 
results but differ considerably from those of SCAD. 

For the ANOVA-type method we used SIR with the number of slices ranging from 2 to 
100. Abdomen, Weight, Biceps and Knee were selected with 99, 87, 88 and 23 of the 99 
different numbers of slices, respectively. All other variables were selected less than 15 times. 
On the basis of these results we recommend a model based on Abdomen, Weight and Biceps. 

As an explanation of the fact that Biceps was not selected by any of the other meth- 
ods, we investigated possible violations of the modeling assumptions on which they are 
based. Marginal plots of the response versus each of the important variables reveal both 
hetoroscedasticity and nonlinearity. Moreover, the 99 applications of SIR yielded more than 
one linear combination (i.e. K > 1) 50 times. To put this number into perspective, we 
generated a single set of responses, using the same covariate values with coefficients those 
from Adaptive LASSO and normal errors using the residual variance. Application of SIR 
with the number of slices ranging from 2 to 100 on this data set yielded K = 1 92 out of the 
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99 times. This casts serious doubts on the validity of the assumption of a linear model. 



Appendix 

A Auxiliary Results 

Lemma A.l. Let X±, . . . ,X n be iidfF], and let F n (x) be the corresponding empirical distri- 
bution function. Then, for any constant c, 



sup XiiX \ \F(xi) - F(xj)\I 



\F{ Xi ) - F( Xj )\ < - 



Proof. By the Dvoretzky, Kiefer and Wolfowitz (1956) theorem, we have that Ve > 0, 



P sup \F n (x) - F(x)\ > e < Ce 



-2ne z 



Therefore, \F(x) — F(x)\ = O p ( -^J uniformly on x. Hence, writing 



\F( Xi ) - F( Xj )\ = \F( Xi ) - F n { Xi ) + F n ( Xl ) - F(xj) + F n ( Xj ) - F n (x 



3)\1 



it follows that sup Xi>Xj < \F(xi) — F(xj)\I \F(xi) — F(xj)\ < c/n 



is less than or equal to 



sup Xi)Xj <y\F{xi) - F n (xi) \ + \F n (xj) - F(xj)\} 



+sup Xi>x . < \F n {xi) - F n (xj) \ \ I \F n (xi) - F n (xj)\ < 



n 



This completes the proof of the lemma. 



□ 



Lemma A. 2. With Wi be defined in (TJJ) ; and any Lipschitz continuous function g(x), 



1 n / i \ 

- V g(x 2j )I(j e Wi) - g(x 2i ) =O p [-=), 



uniformly in i = 1, . . . , n. 
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Proof. First note that by the Lipschitz continuity and the Mean Value Theorem we have 



\g(x 2j ) - g(x 2i )\ < M\x 2j - x 2i \ < M\F X2 (x 2j ) - F Xo _(x 2i )\/ f X2 (xij), 



for some constant M, where between x 2 j and x 2i . Thus, 



V 



3=1 
< 



V 



-y n 1 n 

-^2g(x2j)I(] e Wi) -g(x 2l ) < - ^2\g(x 2j ) - g(x 2i )\I 



M^ \F X2 {x 2j )-F X2 {x 2i )\ I 



3=1 



Fx 2 (x 2i ) -F x Jx 2j )\ < 



p — 1 
2n 



P 7=i 



\F X2 (x 2i ) - F X2 (x 2j )\ < 



p — 1 
2n 



Or. 



where the last equality follows from Lemma IA.1I and the assumption that f X2 remains 
bounded away from zero. □ 

As in Wang, Akritas and Van Keilegom (2008), MST-MSE given in ( ITT]) can be written 
as a quadratic form $, V A£ V , where 

np — 1 „ 



.4 



\ n I T I 



(19) 



n(n — l)p(p — 1) * n(n — l)p n(p — 1) 

where 1^ is a identity matrix of dimension d, Jd is a dxd matrix of l's and © is the Kronecker 
sum or direct sum. Using arguments similar to those used in the proof of Lemma 3.1 in Wang, 
Akritas and Van Keilegom (2008), it can be shown that if a 2 (.,x 2 ), defined in Theorem 12. 11 
is Lipschitz continuous and E(ef) < oo then, under H and as n — > oo, 



where A d = diag{B 1 , B n }, with B t = [J p - l p ] 



(20) 



Lemma A. 3. For a symmetric, positive definite bandwidth matrix H n , define the norm 
\\H n \\ to be the maximum of its eigenvalues. Then we have 



J2w(X u ,X lj2 )\\X lj2 - Xu\\ = 0{\\Hl 

32=1 



/2| 
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Proof. Let b be such that K(x) = K{x)I(\ |x| | < y/ d — lb). Such a b exists by the assumption 
that the density K has bounded support. Thus, 

K (H-^(X U - X„)) HXu - X M || = if (^ 1/2 (X l4 - X„)) H^/ 2 ^- 1 / 2 ^^ - Xy)|| 

< # (ir-^pCu - x„)) h^iiii^-v^Xm - X„)|| 

< K {H-^iXu - X y )) \\HV 2 \\Vd=lb. 

The statement of the lemma follows from the above. □ 

B Proofs of Theorems 

Proof of Theorem \2.1[ Under H in @ we write 

& = - mi(X H ) + mi(Xii) - mi(X H ) = ^ - (mi(X H ) - mi(Xii)) 

= & — A roi (Xii), 

where A mi (Xxj) is defined implicitly in the above relation. Thus, £ v of relation (flOj) is 
decomposed as £y = £ y — A mi y, where £ y and A mi y are defined as in (TIP]) but using £j 
and A mi (Xxi), respectively, instead of £j. Thus -^(MST - MSE) can be written as 

Vr&'vMv = Vn&Mv ~ V^^ v AA miV + yfh~A' miV AA miV , (21) 

where the matrix A is defined in ffT9]) . The asymptotic normality of \/ri£yA£y follows by 
arguments similar to those used in Theorem 3.2 of Wang, Akritas and VanKeilegom (2008). 
It remains to derive its asymptotic variance and to show that the other two terms in ( 12TT) 
converge to zero in probability. Using ( 120]) it suffices to find the asymptotic variance of 
y/n£' v A(i£y. Since E(£ v A d £ v ) = its variance equals E[(y/n£ v A d £ v ) 2 }. To find this we 
first evaluate its conditional expectation, E[(y/n£ v A d £ v ) 2 \{X2j} 1 j =1 }, given X 2 i, . . . , X 2n - 
Recalling the notation a 2 (.,X2) = E(C, 2 \X 2 = x 2 ), we have 
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n n n 



rip E E E E (ZhZhtj2th\{Xij}j=i)l(3s e W ls X e W ta ,s = 1,2) 



n(p 



2 



n(p — 


I) 2 


2 




n(p — 


I) 2 


2 




ra(p- 


I) 2 


2 





1)2 E E E ff2 (-' ^^(•.^(i- ' e n ^») ( 22 ) 



*1=1 «2 = 1 J^i 
n n n 



n n n 



n n n / 2 \ 

j 1= li 2 =l Z^j \ / 



3=1 ii=l 12=1 /^j 

2^ 



2 p(p-i)(2p-l) ^ 4 ^ x.^ffM 
= § £ C ' X2j) + ° p J ' 

where the third equality follows from Lemma IA.2I using the assumption that a 2 (.,X2) is 
Lipschitz continuous and the second last inequality results from the fact that if 1 < j 2 \ — 
s < p — 1, then they are (p — s) 2 pairs of windows whose intersection includes ji and j 2 . 
Taking limits as n — > 00 it is seen that 

From relation (122]) it is easily seen that E[(y/n£ v A d £ v ) 2 \{X 2 j}™ =1 } remains bounded, and 
thus Var(ri 1//2 £yA£y) also converges to the same limit by the Dominated Convergence Theo- 
rem. Hence, n^^Mv 

converges in distribution to the designated normal distribution. That 
the second and third terms in fl2Tj) converge in probability to zero are shown in Lemmas IC.14 
IC.2I , respectively. □ 

Proof of Theorem \2.2l Part 1: Local Additive Alternatives 

Note that we can write £j = Yj — mi(Xij) as 

ij = Yj - mifXy) - p n fh 2 (X 2j ) - [mi(X y ) - m^Xy)] + p n fh 2 (X 2j ) 

= Cj~ A mi (Xi y ) + p n m 2 (X 2i ), (24) 
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and therefore $, v = £ v — A mi y + p n m 2 y, where £ v , A miV and m 2 y are defined as in (TTOj) 
but using £j, A mi (Xij) and m 2 (X 2 i), respectively, instead of £j. Thus, we can write 

Vn~(MST - MSE) = ^i v Al v = V^(€v ~ A mi y)'A(£ v - A miV ) + 

+ y/n2p n (£ v - A miV )'Arh 2V + ^p 2 n m' 2V Am 2V . (25) 

By Theorem EB y^(£ y - A miV )'A(£ v - A miV ) 4 N(0, [2p(2p - l)r 2 ]/[3(p - 1)]). That 
y/n2p n {$ >v — A mi y)' Am 2 v 4 and A/np 2 m 2y ylm 2 y 4 a 2 pV(rh 2 (X 2 )) are shown in Lemma 
IC.3I and Lemma IC.44 respectively. This completes the proof of part 1. 

Part 2: Local General Alternatives 

Working as in ([24]) we can write kv = iv ~ A mi y + pi n m 2 y + p 2 „m w , where £ v , A miV , 
m 2V and m 12 y are defined as in (flQj) but using £ i5 A mi (X!j), rh 2 (X 2i ) and m 12 (Xxj, X 2i ), 
respectively, instead of £j. Thus ^/n(MST — MSE) is 

V^y^ty = \M£y - A mi y - pi n m 2 y)',4(£ y - A mi y - pi„m 2 y) 

+ \fn2p 2n {i v - A mi y - pi n m 2 y) / Ami 2 y + y/npl n m' 12V Am 12V . (26) 

By Part 1 of the theorem, y/n(£ v - A miV - p ln m 2V )'A(£ v - A miV - pi n rh 2 y) converges in 
distribution to 

N(a 2 pVar(m 2 {X 2 )), [2p(2p - l)r 2 ]/[3(p - 1)]). 

Hence, it is enough to show that ^/np 2n rh' 12V An\i 2 v 4 pb 2 Var(rhi 2 (X.i, X 2 )) and y/n2p 2n (£ v — 
A mi y — p ln rh 2 y) / y4rh 12 y 4 2pabCov (m 2 (X 2 ), m 12 (X 1; X 2 )). These are shown in Lemmas 
IC.6l and lC.5l respectively. □ 

C Some Detailed Derivations 

Lemma C.l. The second term in [21\) converges in probability to zero, i.e. 

T 2n : = v^'yAA^y A 0. 
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Proof. After some algebra it can be seen that 

v /y\y ) i=1 ^ £W -. fcgVF . 

-1/2 n n —1/2 ™ 

-^T^£ Ami(Xlj) - 7^5> A -( X ^ (27) 

v / i=\ j=i vA* I i=1 

We will show that each of the three terms above converge in probability to zero conditionally 
on the set of observed predictors, {Xj}™ =1 , and thus also unconditionally. Note that, because 
all windows Wj are of finite size p, the first term on the right hand side of (127)1 can be written 
as a finite sum of p 2 terms each of which is similar to the last term in ( 127)) . Thus, it suffices 
to show that the last and second terms of (127)) converge to zero. For notational simplicity, 
all expectations and variances in this proof are to be understood as conditional on {Xy}" =1 . 
For the last term in ( 127)) we have 

n n n 

n-^J^CAmd^ii) = n- 1 / 2 ^^ w (X li ,X lj )(m 1 (X y ) + e j -^i(X li ))e i 
i=i i=i j=i 

n n 



77 

t=l i=l 
n n 



+ n- 1/2 X)Z) u; ( X "' X y)^- ( 28 ) 
i=i j=i 



The first term of the right hand side of ( 1251) has zero expectation, so it suffices to show that 
its variance goes to zero. To this end, we write 

n n 

v i=l j=l 

j n n n 
= ~ ^2 W ( X 1*' X ljiM X l*> X lj 2 ) x 

i=l Jl = l J2 = l 

x(mi(Xij 1 ) - mi(X li ))(m 1 (Xi i2 ) - mi(X H ))^ar(^) 

M - n n 

- — $^w( x H> x iji)w( x ii> x ij 2 ) x 

x(c||Xi jl - Xii||c||Xij 2 -Xij||) 



n 

i=i j 1= i j 2 =i 



Mc 2 0(\\H^\\)0(\\H l J 2 \\), 
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for some constants M and c, where the inequality holds by the assumed conditions for mi(-), 
and the last equality follows from Lemma IA.3I Thus, by the assumptions of Theorem 12.11 
the first term of the right hand side of (1281) goes in probability to zero. To show that the 
second term in ( 128]) also goes to in probability since, we will show that its second moment 
goes to zero. To this end, we write 



E 



n n n n 



n Yl Yl EE &i&2&if*M x iti. Xi jl )w(Xi i2 , X : 

11=1 12 = 1 jl=l J2=l 
,i n 

H x ^ x u) 2 + w ( x i- x iX x i^ x 



i=l j=l 



+ w(X li ,X li )w(X li ,X li )] 

i " 

+ - E(tf)w(X u , X u )w(X u , X H ) 



n 

i=l 
r2 n n 



< Syr 



t^n 2 |# n |/ Xl (x lt ; 



1 2 

+ 



M 2 
+— > 



wiWxi(X 



/xi(Xii) / Xl (Xi^ 
/ 1 \ / 



liJ 



r 1 > 


+ 


f 1 ^ 




#71 




^n 2 







(29) 



for some constants M 1; M 2 and c, by the fact that fi converges uniformly to / a.s. in the 
compact support Sx 1 (Ruschendorf 1977). Thus, by the assumptions of Theorem 12.11 the 
second term of the right hand side of (128|) goes in probability to zero. 



Consider now the second term in (127|) . Since rT 1 ! 2 Y17=i £» remains bounded in probabil- 
ity, its convergence to zero will follow if we show that n' 1 Ylk=i (Xifc) A 0. For later 
use, we will actually show that 



1 n 1 n 

-^^A mi (X lfe ) = -^^(^(X^) -m x (X lfc )) A 0. 



(30) 



k=l 



k=l 



For this we use (cf. Hansen, 2008) 



sup |mi(x) — mi(x)| = O p (a n ) where a r 



i \ V2 



n 



X d-i 



(31) 
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where A — > at the same rate as the eigenvalues Aj, i — 1, . . . , d — 1, of if n . Therefore, the 
term in the left hand side of (I30I) is of order 



n s/4 p \ y nA d-i 

by the assumed conditions stated in (1121) . This completes the proof of Lemma [Cll □ 
Lemma C.2. The third term in [2l\) converges in probability to zero, i.e. 

T 3n = V^A' miV AA miV 4 0. 

Proof. In this proof we will use Wy to denote w(X.u, Xy). Writting 

y/n(np — 1) 



- 3n 



n E 5> mi ( Xl ,) 



^(P^J-^pW, (32) 



we have to show that each of the three terms on the right hand side of fl32|) converges to 
zero in probability. First notice that, because all windows Wi are of finite size p, the first 
term on the right hand side of ( |32i) can be written as a finite sum of p 2 terms each of which 
is similar to the last term in (132|) . Therefore, to show that the first and third terms in (132|) 
go to zero in probability it is enough to show that n~ 1 / 2 ^" =1 A^(X H ) A 0. Using (EI]), it 
is easy to see that 



-1 /9 ' 

n 

i=i 



V2 Y,^^i) <n 1,2 O p {an) 2 = P {1). 



That the second term on the right hand side of fl32l) converges in probability to zero follows 
directly from fl30|) . □ 



Lemma C.3. The second term in Why converges in probability to zero, i.e. 

yfrtlpJXy - A miV )'Am 2 v A 0. 
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Proof. By the definition of the matrix A, we can write (£y — A. mi v)'Am2v as 
np — 1 



n(n — l)p(p — 1) ^ 
1 



Li=i 



^fc-A mi (X lfc ))/(A;eWi) 



fc=i 



n(n — l)p 



P 



1=1 



P ^(ei-A mi (x u )) 



i=i 



— -^m 2 (X 24 )te-A mi (X li )). 



Using Lemma [A. 21 and the fact that rfi 2 (-) is Lipschitz continuous, the sum in the first term 
can be expressed as 



pJ^lMX^ + Oin- 1 / 2 )} 



< 



i=l 



pJ2 

k=l 



< 



A mi (X lfe ))J(A;G Wi) 

i 

(& - A mi (X lfc )) +P 2 0(n" 1 / 2 ) ]T |(& - A mi (X lfe ))| 



i=i 



fc=i 



p 2 ^m 2 (X 2fc )(& - A mi (X lfc )) + O p (pV/ 2 ), 



fc=i 



so that 



^■miV ) 



1 9^ 

an p 
n — 1 



1 9^ 

an p 



n — 1 



~y"m 2 (X 2i )(&-A mi (X li )) 



i=l 



'/I 



^m 2 (X 2i ) 



i=i 



X>-A mi (X u )) 



i=l 



1 



nV4 ; • 



Using the fact that £" (m 2 (X 2i )) = E (rri 2 (X 2 j)£j) = -E(^) = 0, relation fl5D|) and also that 
n -3/4 J^^ =1 m 2 (^2i)A mi (Xii) A 0, as is shown in a similar way to ( 130|) . completes the proof 
of the lemma. □ 

Lemma C.4. The third term in ( fl31) converges in probability to a 2 p\^(m 2 (X 2 )) ; z.e. 



v / np^m' 2l/J 4ra 2 i/ A a 2 pV(m 2 (X 2 )). 



Proof. Writing 



m' 2V Am 2 v 



np 
n — 1 



1 n 1 n 

-^m 2 (X 2t ) 

i=l i=l 



+ 



1/2 



p{£m 2 (X 2 )-[£m 2 (X 2 )] 2 } + 0, 



P \ n l/2 



27 



it follows that 



y/np 2 n m' 2V Am 2V = a 2 pVar(m 2 (X 2 )) + O p \-^J 



which completes the proof. 



□ 



Lemma C.5. The second term in /[2b]) converges in probability to 2pabC (m 2 (X 2 ) , mi 2 (X.i, X 2 )), 
i.e. 

Vn2p2n(€v - A mi y - pi n ra 2V )' Ara. 12V A 2pabCov (rh 2 (X 2 ), mi 2 (Xi, X 2 )). 



Proof. By the definition of the matrix A, we can write 



p ln m 2V )' Am 12V = Vnp 



np — 1 



x 



i=l 



J2mi2(^i j ,X 2j )I(jeW i , 



n(n — l)p(p — 1) 

n 

J2(£ k - A mi (X lfc ) - p ln m 2 (X 2k ))I(k G Wy 



k=l 



-Vnp 



2n 



n(n — l)p 



-Vnp2n ( P ^ y^i2(X u ,X 2 j)(& - A mi (X H ) - pi n m 2 (X 2i )). 



P 



y^ j m 12 (Xii,X 2 i 



i=l 



V 



- A mi (Xii) - pi n m 2 (X 2i )) 



i=i 



(33) 



'n(p- 1) ^ 

Noting that n~ 3/4 ]T™ =1 ^m^X^, X 2i ) A 0, and ^ ^" =1 (X u ) 
which follows by arguments similar to ( 130]) . the third term in (133]) goes in probability to 
[pab/(p - l)] J B(m 2 (X 2 )m 12 (X 1 ,X 2 )). Also, using fl3U]), and the facts £(m 12 (Xi, X 2 )) = 0, 
and 77~ 3//4 X]T=i = °p(l)' t ne second term in ( 133]) goes to pabE(m 2 (X 2 ))E(fhi 2 (X.i, X 2 )) in 
probability. Next, the component of the first term in (133]) that corresponds to 



n n n 



^^^m 12 (X l3 ,I 2j )(a - A mi (X lk ))I(j G G Wi) 

J = l fc=l 8=1 

goes to zero in probability by arguments similar to those used for the last term in (133]) . Set 
rni(X 2l ) = iEi=i^(X 2i )/(j G Wi) and m\ 2 (.,X 2i ) = \ £? =1 mi^Xy, X 2i ) I(j G W t ), so 
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that 



~J2m 12 (X 1:j ,X 2j )I(j G Wi) = m\ 2 {.,X 2i ) + o p (l), 
P 3=1 
1 n 

- ^m 2 (X 2j )I(j G Wi) = fnl(X 2i ) + 0p (l). 



i=i 



The remaining component of the first term in (|33p can be written as 



n n n 



i=l fc=l i=l 

p-l]pa6 1 " ^ ^ 4 ^E[m 12 (X 1 ,X 2 )m 2 (X 2 )] 

Dip — i n ^—^ p — 1 



(n-l)(p 
completing the proof. 



□ 



Lemma C.6. TTie £/izrc£ term in l[26}) converges in probability to pb 2 Var(fhi 2 (Ki, X 2 )) , i.e. 



yJnp 2 2n m! l2V Am 12V -4 p6 2 Var(mi 2 (Xi,X 2 )). 



Proof. Note that we can write ^np\ n \ri l l2V A\x\.i 2 v as 



(np — 1)6 2 
n(n — l)p(p — 1) ^ 

pb 2 
n(n — 1) 



5^m 12 (X y ,^)/(jeWi; 



5Em 12 (X lfc) X 2fc ))/(fce Wi) 



fc=i 



y^mi2(Xif,X 2 i 



8=1 



^ 77112 (X li) X 2i 



i=l 



pb 



n(p 



— — ^mi 2 (Xii,X 2i ) 2 . 



(34) 



i=i 



Clearly, the third term in ( 134)) goes to \pb 2 /(p — l)]-E'[m 12 (X 1 , X 2 ) 2 ] in probability, and the 
second term in ( l3"4"l) goes to p6 2 [i?(mi 2 (Xi, X 2 ))] 2 in probability. Using the same notation as 
in lemma IC.51 the first term in (l34"j) is equal to 



(np l)p6 2 £ ^ 12 (. ; x 2i )] 2 + 0p (l) A ^£[(m 12 (X l4 ,X 2 ,)) 2 ], 



n(n- l)(p- 1) ^ 
completing the proof. 



p — 1 



□ 
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